Part I - Dataset Exploration Airline On-Time Performance Data

by LARIANE Mohcene Mouad

Selecting a random sample from the dataset to work on

Since the original dataset is too large ( + 7m rows ), I will do my study on only a random 1.5m records sample that was created in the Part_0_wrangling_and_creating_a_sub_sample_to_work_on.ipynb notebook where I've delt with the following issues :

I will work on this sample of 1.5m record instead of the original, in order to reduce execution time for this project.

Loading clean dataset

Fixing features types

Univariate Exploration

Question

When do people most travel the most ?

Note: This dataset contains flight records for 2007 only, so there will be no need to make observation on the year in the date feature

Sub-Question

What month do people prefer to travel ?

Visualization

Observation

The most prefered months to travel are July and August which was expectable, because people tend to travel a lot during summer vacation. Also the 2nd most prefered period to travel is during March, April, May which does makes sense because it the spring season period where the weather is fine. The least prefered month to travel in is Fabuary which does also makes sense because it's basically the period when everyone is either working or studying.

Sub-Question

What month days do people prefer to travel ?

Observation

Number of flights decreases at the end of the month (from 29th to 31th), which means that people travel less at this period. The 31th can be ignored since not all months have 31 days. Also from another perspective since only 7 months have 31 days and we've 29k flight records in the 31th, this means that people do actualy travel often on 31th. Therefore we can remove 31th from our list of month days the people travel the least, which leave us with only 29th, and 30th as final result.

Sub-Question

What week days do people prefer to travel ?

Visualization

Observation

Monday, Friday, Thursday are the most prefered days to travel. Friday and Monday are weekend days which explains why they're are the two best week days to traval in, other week days with a high number of flights record such as Thursday and Wednesday are reasonable too, where people such as employees for instance can also take business trips.

Question

what carries dominate the airline ?

Visualization

Observation

Southwest Airlines dominates the market, where it covers 16% of it, x2 ahead from its best competitor which is The American Airlines.

source : Carrier Codes and Names

Question

The location(s) that people travel from/To the most ?

Visualization

Observation

The three airports with the most outgoing and incoming number of flights are :

Therefore the locations that people travel from/to the most are :

Note: These three US cities are the most three active air traffic cities in our records

Question

At what time do most flights take-off ?

Visualization

Observation

Most flights take-off from 6am to 7pm

Question

At what time most flights are scheduled to Take-Off ?

Visualization

Observation

Most flights are scheduled to take-off from 6am to 7pm

Question

At what time most flights land/Arrive ?

Visualization

Observation

Most flights land/Arrive from 6am to midnight

Question

At what time most flights are scheduled to land/Arrive ?

Visualization

Observation

Most flights are scheduled to arrive from 7am-8am to 11pm

Question

How much time do these flights take to reach their destination ?

Visualization

Taking a deeper look of distance values greater than 3000 miles

Observation

The histogram is right skewed, where most distances are between 100-3000 miles, and a relatively small number of flights (2214) with a distance greater than 3000 miles

Question

How long do flights would take (Actual Elapsed Time) to reach their destination ?

Visualization

Observation

Normal distribution, Most flights take between 30-300 minutes (half an hour to 5 hours) to arrive to their destination

Taking a deeper look of flights with values Actual Elapsed Time greater than 600 minutes

Visualization

Observation

118 flights take from 600 minutes ( 10 hours ) to 700 minutes ( 11.5 hours ) and 2 flight from 700 minutes ( 11.5 hours ) to 800 minutes ( 13 hours ) to reach their destination these flight's Origin and destination are very far away from each others which explains why they take this long to reach to their destination. Because such long flights do many stopovers

Observation

CVG, LAN flight and LAX, HNL are the top two long flights (outliers)

Question

How long do flights are scheduled (Scheduled Elapsed Time) to reach their destination ?

Visualization

It looks like we've some sort of negative values, where we're supposed to have only positive values

Looks like that CRSElapsedTime value for this row is inaccurate and must be dropped

Scaling on the x axis for a better plot

Observation

Normal distribution where most flights have a scheduled elapsed time between 30 min to 300 minutes ( half an hour to 5 hours )

Observation

Only 2817 flights have a scheduled elapsed time less than 30 minutes, these flights covers an average distance of about 113 mile

Only EWR, HNL and ATL, HNL take about 600 minutes ( 10 hours ) to reach their destination, which makes sence because these two flights origine cities are close to each others (Newark , Atlanta in the US) and their destination is the same ( Hawaii ).

Observation

ORD, MSN flight is considered the flight with the longest scheduled time in our records (outlier) , while these too cities are so close of each others ( Both and the US and the distance between them is 109 mile ) and the flight had no delays. Therefore the CRSElapsedTime value is wrong (inaccurate) ans should be fixed/dropped.

Question

How long do flights take in the air (AirTime) to reach their destination ?

Visualization

Observation

Most flights take less than 200 minutes in the air (3.5 hours) which are considered short flights. also a considerable number of flights takes between 200 minutes to 400 minutes (3.5 - 6.5 hours) in the air

Observation

5903 flights take more than 400 minutes (6.5 hours) in the air

Observation

LAX, HNL flight takes the longest AirTime, which confirms the results found when exploring ActualElapsedTime feature

Question

How many flights arrive late/early ?

Visualization

Observation

Normal distribution, which means that flights arrive late and early almost equaly, where most values are between -50 minutes (50min early arrival) and 100 minutes late

Applying logarithmic transformation on the x axis

Observation

Early arrival : flights tend to arrive 5-30 minutes earlier than the Scheduled arrival time

Late arrival : right skewed histogram, flights tend to arrive 5-300 minutes late

Outliers

Doing some stats

Observation

Only 2.74% of the flights arrive in time which indicates that flights rarely arrive exactly in time, where the number of flights with early arrival is slightly higher than flights with late arrival.

Question

How many flights leave late/early ?

Visualization

Observation

slightly right skewed, which means that the number of flights that leave early is slightly higher than the number of flights that leave late

Applying logarithmic transformation on the x axis

Observation

Early arrival : Unimodel distribution, where flights leave 1-30minutes max early

Late arrival : Non symetric unimodel, where flights leave 1min-5hours late

Doing some stats

Observation

Only 8.41% of the flights leave in time which indicates that flights rarely leave in time, where the number of flights with early departure is slightly higher than flights with late departure.

Question

How many flights were cancelled ?

Important Note: since I've done the data wrangling in another notebook, and I've made a mistake replacing boolean values in the dataset, I won't be able to fix the code in that notebook, because the sample that we're working on already created (fixing that code and re-run that notebook will create a whole different random sample that would be different from the current sample we're working on ( different potential results ), therefore to solve this simple problem I'll replace the values in this notebook instead and carry on with my analysis.

Fixing the issue

Test

Visualization

Observation

Only 2.17% of the flights were cancelled

Question

How many flights were diverted ?

Important Note: Same as the previous issue

Fixing the issue

Test

Visualization

Observation

Only 0.23% of the flights were diverted

Question

Because of what do flights get cancelled the most ?

Renaming values :

A = carrier, B = weather, C = NAS, D = security

Visualization

Observation

Almost all cancellation are bacause of the weather conditions or the carrier, around only 6000 cancellations are a result of National Aviation System situation, where only 7 fights were cancelled because of security reasons.

The rest of the features

Observation:

all previous features are similar, almost all kind are not long (couple of minutes), just in special cases where they could take a longer time ( couple of hours maybe )

Question

How long do flights take to park or to take-off

Visualization

Observation:

Normal distribution, almost all flights take from 2 min to 1 hour to park

Observation:

Normal distribution, almost all flights take from 3 min to 1.5 hour to take off

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

ActuaElapsedTime and CRSElapsedTime :

ArrDelay/Depdelay :

Cancelled/Diverted :

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Flight date/time :

Distance : most distances are between 100-3000 miles, and a relatively small number of flights (2214) with a distance greater than 3000 miles

Bivariate Exploration

Question

What's the relationship between the Departure delay and Arrival delay ?

Visualization

Observation

There's a correlation between the Deparure and Arrival delays, in almost all flights, the longer the departure delay is the long there will be an arrival delay. notice that there's one outlier point that we must check

This flight delay is more than 2500 minutes because of a carrier delay. even though this row is considered as an outlier I won't drop it, beacuse I will need the delay values and causes when doing my analysis on the delay related features

Quesion

What are the most dar destination the people often travel to ?

Visualization

Observation

Airports/cities such us PSE, SJU, BQN are considered one of the most far distances that people travel to

Quesion

What are the most close destination the people often travel to ?

Visualization

Observation

Airports/citites such as TYR, ACT, EAU are considered one of the closest destinations that people do actually bother to travel to on place

Quesion

Visualization

Observation

Notice a correlation between the distance and the Actual Elapsed Time, the longer the distance the longer time the fight will take to reach its destination, which makes sence. however there's two flights with a very short distance that took a very long time to reach their destinations, these flights must be observed.

As was expected these flights took al this time to reach their destinations because of some long delays before and during the flight.

Quesion

What carriers often take over long/short flights ?

Visualization

CO, B6, UA and AA carriers usualy handle long distance flights, on the other hand carrriers such as AQ, OO, MQ, YV handle short distance flights

Quesion

What are the best/worst carrier in a matter of flight delay ?

Visualization

Observation

Carriers with the longest average delay per flight are : EV, B6 these carriers are considered the worst because they are always 15-20 minutes behind schedule, on the other hand the best carriers are HA and AQ where they only have an average delay time of about 3 minutes only per flight

Quesion

What carriers has the most/least flights cancellation ratio ?

Visualization

Observation

Carriers with the highest flights cancelation ratio are : MQ, OH, YV where almost 4% of their intire flights were cancelled. On the other hand carriers such as HA, F9 have the lowest cancellation ratio with less than 0.5% of their flights not beeing cancelled.

Quesion

What carriers has the most/least flights divertion ratio ?

Visualization

Observation

Carriers with the highest flights divertion ratio are : XE, B6, CO where about 0.35% of their intire flights were diverted. On the other hand carriers such as AQ, HA have the lowest flights divertion ratio with less than 0.05% of their flights not beeing diverted.

Quesion

What are the reasons behind cancelling the flights for each carrier ?

Visualization

Observation

Almost all carriers cancel their flights for two main reasons : Carrier issues and weather conditions, where cancellations because of NAS (National Aviation System). cancellation because of security reasons rarely happens.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

DepDelay and ArrDelay :

Dest and Distance :

Distance and ActualElapsedTime :

UniqueCarrier and Distance :

UniqueCarrier and Delay features :

UniqueCarrier and Cancelled :

UniqueCarrier and Diverted :

UniqueCarrier and Cancellation code :

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I didn't perform any bivariate visualizations on these features

Multivariate Exploration

Quesion

What is the relationship between the distance and the mean delay time for each carrier ?

Visualization

Observation

I noticed that there isn't really a relationship between the distance and the delay, where this latter depends on the carrier, in other words how the carrier planes and stuff deal with long/medium/short distances. Where carriers such as EV and B6 always suffer from long delays during its flights (mean of 18-19min per flight) whether the long or short which indicates that these carriers are having problems managing their flights, there carriers are considered the worst carriers in matter of delay time.

however carriers such as HA and AQ often handle short distance flights (max 600 miles long ) where these carriers makes sure that their flights aways take off and arrive in time. These carriers are happen to be the best choice for short distance flights where they're doing a good job handeling short distance flights.

For the rest of the carriers we can see that each carrier handle a perticular type of flights, where some carriers usually handle long distance flights and other carriers handle short distance flights. these carriers are doing an acceptable job handling and managing their flights (mean delay time per flight is 12-16 min)

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Some carriers usually handle long distance flights, where other carriers handle short distance flights, so there's a relation between the distance and the carrier, however the delay time depends on how does each carrier manage its flights.

Were there any interesting or surprising interactions between features?

Yes, some carriers are specialised in a particulare type of flights (short/medium/long) these carriers are doing a fine job handling this type of flights, however other carriers seems to have some probems managing all their flights in general, where they always suffer from delays. On the other hand carriers that handle various type of flights seems to have an acceptable average deley time per carrier

Conclusions

During this analysis, I managed to fogured out when do people often travel and What flights are more likely to be cancelled or delayed and the reasons behind this cancellation, and how well the carriers are handling and managing their flights. And these are some main fondings ?

About the passengers and airports

About the flights

About carriers